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Abstract —Attributing the culprit of a cyber-attack is widely 
considered one of the major technical and policy challenges 
of cyber-security. The lack of ground truth for an individual 
responsible for a given attack has limited previous studies. Here, 
we overcome this limitation by leveraging DEFCON capture-the- 
flag (CTF) exercise data where the actual ground-truth is known. 
In this work, we use various classification techniques to identify 
the culprit in a cyberattack and find that deceptive activities 
account for the majority of misclassified samples. We also explore 
several heuristics to alleviate some of the misclassification caused 
by deception. 

I. Introduction 

Attributing the culprit of a cyber-attack is widely con¬ 
sidered one of the major technical and policy challenges of 
cyber-security. The lack of ground truth for an individual 
responsible for a given attack has limited previous studies. In 
this study, we take an important first step toward developing 
computational techniques toward attributing the actual culprit 
(here hacking group) responsible for a given cyber-attack. We 
leverage DEFCON capture-the-fiag (CTF) exercise data which 
we have processed to be amenable to various machine learning 
approaches. Here, we use various classification techniques to 
identify the culprit in a cyber-attack and find that deceptive 
activities account for the majority of misclassified samples. We 
also explore several heuristics to alleviate some of the misclas¬ 
sification caused by deception. Our specific contributions are 
as follows: 

• We assemble a dataset of cyber-attacks with ground 
truth derived from the traffic of the CTF held at 
DEFCON 21 in 2013. 

• We analyze this dataset to identify cyber-attacks where 
deception occurred. 

• We frame cyber-attribution as a multi-label classifica¬ 
tion problem and leverage several machine learning 
approaches. We find that deceptive incidents account 
for the vast majority of misclassified samples. 

• We introduce several pruning techniques and show 
that they can reduce the effect of deception as well as 
provide insight into the conditions in which deception 
was employed by the participants of the CTF. 

In our text on cyber-warfare lO, we discuss the difficulties 
of cyber-attribution and how an intelligence analyst must also 


explore the deception hypothesis in a cyber-warfare scenario. 
When compared to other domains of warfare, there is a much 
greater potential for evidence found in the aftermath of cyber¬ 
attack to be planted by the adversary for purposes of deception. 
The policy implications of cyber-attribution have also been 
discussed in m where the authors point out that anonymity, 
ability to launch multi-stage attacks, and attack speed pose 
significant challenges to cyber attribution. 

In an early survey on cyber-attribution ||T|, the authors point 
out that technical attribution will generally identify machines, 
as opposed to a given hacker and his/her affiliations. While 
we will use technical information in our approach, we have 
ground truth data on the group involved by the nature of 
the capture-the-fiag data. This will allow our approach to 
profile the tactics, techniques, and procedures of a given group 
as we have ground-truth information on a hacking group as 
opposed to machines. An example of such an approach is the 
WOMBAT attribution method m which attributes behavior 
to IP sources that are potentially linked to some root cause 
determined through a clustering technique. Similarly, other 
work in combines cluster analysis with a component for multi¬ 
criteria decision analysis and studied an implementation of this 
approach using honeypot data - again, this approach lacks any 
ground truth of the actual hacker or hacking group. 

Concurrently, we have devised a formal logical framework 
for reasoning about cyber-attribution 0, Q. However, we 
have not studied how this framework can be instantiated on 
a real world dataset and, to date, we have not reported on an 
implementation or experiments in the literature. We note that 
none of the previous work on cyber-attribution leverages a 
data set with ground truth information of actual hacker groups 
- which is the main novelty of this paper. 

H. Dataset 

Our dataset consists of events recorded from a Capture- 
the-fiag (CTF) tournament held at DEFCON 21 in 2013. 
Briefly, CTF competitions act as educational exercise that 
exposes real world attack scenarios to participants. Network 
sniffing, analysis of protocols, programming and system level 
knowledge, cryptanalysis are some of the instrumental skills 
acquired by contestants. 

Our data represents attack/defense style, where each team 
owns a small network of machines to defend. Teams are judged 


based on scores given to attack machines of other teams as well 
as defending their own network. Initially, all virtual machines 
are configured with specific set of services. These services 
are vulnerable to state-of-art hacking techniques. Files can be 
considered as form of fiag to be captured from other teams or 
to be planted to other teams by exploiting those vulnerabilities. 

DEFCON CTF organizers recorded network traffic that 
includes network packets generating to and from all partici¬ 
pating teams and is available on Internet IH. Recordings are 
stored as archive files of PCAP (packet capture) for each team 
(destination as that team) separately. PCAP file contains packet 
headers (TCP, SSL, UDP etc.) and respective data as source, 
destination, sequence numbers etc. with timestamp having 
millisecond precision. Using open source tool tcpfio\\Q we 
interpreted collection of PCAPs as cumulative data streams. 
Tcpfiow reconstructs actual data streams from the packets 
that proved helpful in protocol analysis and debugging. This 
tool produces a file containing the contents of each stream, 
representing the data sent between two points in the CTF 
system. 

For each file, we computed an md5 checksum, a byte 
histogram, and an ARM instruction histogram. This data was 
recorded as a list of tuples (time-stamp, hash, byte-histogram, 
instruction-histogram) in a ISON document. These individual 
fields of the tuple are listed in Table 

TABLE 1: Fields in an instance of network attack 


Field 

Intuition 

byte_hist 

histogram of byte sequences in the payload 

inst_hist 

histogram of instructions used in the payload 

fromjeam 

the team where the payload originates (attacking team) 

to_team 

the team being attacked by the payload 

SVC 

the service that the payload is running 

payload_hash 

indicates the payload used in the attack (md5) 

time 

indicates the date and time of the attack 


From this pre-processing of the network data (packets) we 
have around 10 million network attacks. There are 20 teams 
in the CTF competition. In order to attribute an attack to a 
particular team, apart from analyzing the payloads used by the 
team, we also need to analyze the behavior of the attacking 
team towards his adversary. For this purpose we separate the 
network attacks according to the team being targeted. Thus 
we have 20 such subsets. We represent the 20 subsets (teams) 
as T-i, where i = 1, 2, 3...20. An example of an event in the 
dataset is shown in Table |2l 

TABLE 2: Example event from the dataset 


Field 

Value 

byte_hist 

0 X 43:245, 0 X 69:8, 0 X 3a:9, 0 X 5d:l, . 

inst_hist 

cmp:12 , svcmi:2, subs:8, movtmi:60. 

from_team 

men in black hats 

to_team 

Robot Mafia 

SVC 

02345 

payload_hash 

2cc03b4e0053cde24400bbd80890446c 

time 

2013-08-03X23:45:17 


A. Dataset Analysis 

We now discuss two important observations from the 
dataset, that makes the task of attributing an observed network 

^https://github.com/simsong/tcpflow 


attack to a team difficult. 

Deception: In the context of this paper we define an attack to 
be deceptive when multiple adversaries get mapped to a single 
attack pattern. In the current setting we define deception as the 
scenario when the same payload is used by multiple teams to 
target the same team. Fig. shows the distribution of unique 
deception attacks with respect to the total unique attacks in 
the dataset based on the target team. These unique deceptive 
attacks amount to just under 35% of the total unique attacks. 
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Fig. 1: Unique deceptive attacks directed towards each team. 

Duplicate attacks: A duplicate attack occurs when the same 
team uses the same payload to attack a team at different time 
instances. Duplicate attacks can be attributed to two reasons. 
First when a team is trying to compromise other systems, it 
just does not launch a single attack but a wave of attacks with 
very little time difference between consecutive attacks. Second, 
once a successful payload is created which can penetrate the 
defense of other systems, it is used more by the original 
attacker as well as the deceptive one as compared to other 
payloads. We group duplicates as being non-deceptive and 
deceptive. Non-deceptive duplicate are the duplicates of the 
team that first initiated the use of a particular payload. On the 
other hand deceptive duplicates are all the attacks from the 
teams that are being deceptive. Deceptive duplicates form a 
large portion of the dataset as seen in Fig. 
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Fig. 2: Total attacks and duplicate attacks (Deceptive and 
Non-deceptive) directed towards each team 


III. Baseline Approaches 

From the dataset, we have the ground truth available for 
all the samples. Hence we use supervised machine learning 


























































approaches to predict the attacking team. The ground truth 
corresponds to a team competing in the competition. 

Decision Tree (DT). For baseline comparisons we first im¬ 
plemented a decision tree classifier. We built the decision tree 
by finding the attribute that maximizes the information gain at 
each split. In order to avoid over-fitting, the terminating criteria 
is set to less than 0.1% of total samples. 

Random Forest (RF). We use a random forest which com¬ 
bines bagging for each tree with random feature selection at 
each node to split the data thus generating multiple decision 
tree classifiers. 

Support Vector Machine (SVM). Support vector machines 
is a popular supervised classification technique that works 
by finding a separating margin that maximizes the geometric 
distance between classes. We use the popular LibSVM imple¬ 
mentation 111 which is publicly available. 

Logistic Regression (LOG-REG). Logistic regression clas¬ 
sifies samples by computing the odds ratio. The odds ratio 
gives the strength of association between the features and the 
class. We implement the multinomial logistic regression which 
handles multi-class classification. 

A. Experimental Results 

For our baseline experiments, we separate the attacks based 
on the team being targeted. Thus we have 20 subsets. We then 
sort the attack according to time. We reserve the first 90% of 
the attacks for training and the rest 10% for testing. Attacker 
prediction accuracy is used as the performance measure for 
the experiment. Accuracy is defined as the fraction of correctly 
classified test samples. Fig.j^shows the accuracy for predicting 
the attacker for each target team. Machine learning techniques 
significantly outperform random guessing which would have 
an average accuracy of choosing 1 out of 19 teams attacking 
yielding an accuracy of 0.053. For this experiment random 
forest classifier performs better than logistic regression, sup¬ 
port vector machine and decision tree for all the target teams. 
Table below summarizes the average performance for each 
method. 
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Fig. 3: Team prediction accuracy for LOG-REG, RF, SVM 
and DT. 


B. Misclassified Samples 

Misclassification can be attributed to the following sources. 


TABLE 3: Summary of Prediction results averaged across all 
Teams 


Method 

Average Performance 

Decision tree (DT) 

0.26 

Logistic regression (LOG-REG) 

0.31 

Support vector machine (SVM) 

0.30 

Random Forest (RF) 

0.37 


• Non-deceptive duplicate attacks attributed to one of 
the deceptive teams. 

• Deceptive duplicates attributed to some other decep¬ 
tive team. 

• Payloads that were not encountered during the training 
phase. 

The first two sources of error make up the majority of 
misclassifications, since a given attack can be attributed to any 
of the 19 teams. 
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Fig. 4: Sources of error in the misclassified samples. 

Fig. shows the distribution of the above mentioned 
sources of misclassification for each team. Deceptive du¬ 
plicates form the majority of misclassifications. This is not 
surprising given the fact that deceptive duplicates make up 
almost 90% of the total attacks (see Fig. [^. 

IV. Pruning 

We explore different pruning techniques to address 
misclassification issues with respect to deceptive and non- 
deceptive duplicates. The pruning techniques are only applied 
to the training data, while the te st data is maintained at 10% 
as mentioned in Section IIII-AI We use the random forest 
classifier for all the pruning techniques. 

All-but-majority (P-1): In this pruning technique, for each 
payload, we only retain duplicates of the most frequent at¬ 
tacking team and prune the duplicates of all other teams. This 
pruned set is then used to train the random forest classifier. 
Table shows the classifier performance in comparison with 
the baseline method. All-but-majority pruning technique has 
better performance on the test set than the baseline approach 
for 11 out of 20 teams. Using this pruning technique does 
benefit majority of the teams as the prediction accuracy im¬ 
proves for them, but for some teams the performance drops. 



























































































The reason for the drop in performance for some teams is 
due to the fact that training set gets dominated by a single 
team which does not have majority in testing set. Since the 
majority team gets represented in most of the leaves of the 
random forest classifier, it gets predicted more often leading 
to high misclassifications. 

All-but-K-majority (P-2): In order to address the issue of 
one team dominating in the training set, we use the all-but-K- 
majority where we consider the K most frequent teams for a 
payload under consideration. After trying out different values 
of K we select K = 3, which gives the best performance. 
For higher values of K, the pruning behaves like the baseline 
approach and for lower values it behaves like All-but-majority. 
On average each team gains about 4QK samples in the training 
set as compared to all-but-majority pruning. Table shows 
the classifier performance. In this case also pruning performs 
better than baseline in 11 out of 20 teams, but as compared to 
all-but-majority the performance for most teams is better. 

All-but-earliest (P-3): For this pruning we only retain the 
duplicates of the team that initiated the attack using a particular 
payload. This pruning technique retains all the non-deceptive 
duplicates while getting rid of the deceptive ones. Table 
shows the classifier performance. This pruning technique per¬ 
forms better than the baseline approach for 8 out of 20 teams. 
Comparing this result to all-but-majority (including all-but- 
K-majority) pruning indicates that deceptive duplicates are 
informative in attributing an attack to a team and should not 
be ignored completely. 

All-but-most-recent (P-4): In this pruning we repeat a similar 
procedure like All-but-earliest but instead of retaining the 
duplicates of the team that initiated an attack, we retain the 
duplicates of the team that used the payload last in the training 
set. Since the data is sorted according to time, the last attacker 
becomes the most recent attacker for the test set. Table [4] shows 
the classifier performance. 

TABLE 4: Pruning technique performance comparison. 


Teams 

RF 

P-l(RF) 

P-2(RF) 

P-3(RF) 

P-4(RF) 

T-1 

0.45 

0.16 

0.46 

0.15 

0.15 

1-2 

0.22 

0.28 

0.30 

0.15 

0.14 

T-3 

0.30 

0.53 

0.29 

0.57 

0.57 

T-4 

0.26 

0.33 

0.27 

0.31 

0.32 

T-5 

0.26 

0.38 

0.45 

0.40 

0.42 

T-6 

0.50 

0.27 

0.24 

0.31 

0.26 

1-7 

0.45 

0.59 

0.58 

0.19 

0.49 

T-8 

0.42 

0.52 

0.52 

0.51 

0.55 

T-9 

0.41 

0.65 

0.68 

0.52 

0.53 

T-10 

0.30 

0.54 

0.34 

0.55 

0.57 

T-11 

0.37 

0.27 

0.35 

0.27 

0.29 

T-12 

0.24 

0.37 

0.37 

0.25 

0.22 

T-13 

0.35 

0.27 

0.37 

0.29 

0.27 

T-14 

0.42 

0.27 

0.40 

0.30 

0.30 

T-15 

0.30 

0.20 

0.27 

0.21 

0.20 

T-16 

0.42 

0.28 

0.22 

0.32 

0.31 

T-17 

0.43 

0.45 

0.35 

0.43 

0.40 

T-18 

0.48 

0.39 

0.43 

0.41 

0.40 

T-19 

0.41 

0.65 

0.58 

0.54 

0.60 

T-20 

0.48 

0.16 

0.16 

0.16 

0.17 


TABLE 5: Summary of Prediction results averaged across all 
Teams 


Method 

Average Performance 

Baseline Approach (RF) 

0.37 

All-but-majority Pruning (RF) 

0.40 

All-but-K-majority Pruning (RF) 

0.42 

All-but-earliest Pruning (RF) 

0.34 

All-but-most-recent Pruning (RF) 

0.36 


V. Conclusion 

In this paper, we study cyber-attribution by examining 
DEFCON CTF data - which provides us with ground-truth 
on the culprit responsible for each attack. We frame cyber¬ 
attribution as a classification problem and examine it using 
several machine learning approaches. We find that deceptive 
incidents account for the vast majority of misclassified samples 
and introduce heuristic pruning techniques that alleviate this 
problem somewhat. Moving forward, we look to employ a 
more principled approach to counter deception based on our 
previously established theoretical framework for reasoning 
about cyber-attribution I|5|,l2l . In particular we wish to employ 
temporal reasoning to tackle the problem of deceptive attacks. 
This opens up interesting research questions in particular iden¬ 
tifying hacking group from a series of attacks over a period of 
time, differentiating between deceptive hacking groups in time 
series data. This is a knowledge engineering challenge which 
calls for development of efficient and scalable algorithms. 

VI. ACKNOWLEDGMENT 

Some of this work was supported by the U.S. Office of 
Naval Research and ASU Global Security Initiative (GSI). 

References 

[1] W. E. Boebert. A survey of challenges in attribution. In Proceedings of 
a workshop on Deterring CyberAttacks, pages 41-54, 2010. 

[2] C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector 
machines. ACM Transactions on Intelligent Systems and Technology 
(TIST), 2(3):27, 2011. 

[3] M. Dacier, V.-H. Pham, and O. Thonnard. The wombat attack attribution 
method: some results. In Information Systems Security, pages 19-37. 
Springer, 2009. 

[4] DEFCON. Defcon: Capture the flag. 2013. 

[5] S. Jajodia, P. Shakarian, V. S. Subrahmanian, V. Swarup, and C. Wang. 
Cyber Warfare: Building the Scientific Foundation. Springer Publishing 
Company, Incorporated, 2015. 

[6] P. Shakarian, J. Shakarian, and A. Ruef. Introduction to cyber-warfare: 
A multidisciplinary approach. Newnes, 2013. 

[7] P. Shakarian, G. I. Simari, and M. A. Falappa. Belief revision in 
structured probabilistic argumentation. In Foundations of Information 
and Knowledge Systems, pages 324-343. Springer, 2014. 

[8] O. Thonnard, W. Mees, and M. Dacier. On a multicriteria clustering 
approach for attack attribution. ACM SIGKDD Explorations Newsletter, 
12(l):ll-20, 2010. 

[9] N. Tsagourias. Nicolas politis initiatives to outlaw war and deflne 
aggression, and the narrative of progress in international law. European 
Journal of International Law, 23(l):255-266, 2012. 


Table gives the summary of the prediction results for 
all the pruning techniques in comparison with the random 
forest baseline approach. In the pruning techniques All-but- 
K-majority works best with an average accuracy of 0.42. 

















